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Abstract 

We consider a planning problem where the dy¬ 
namics and rewards of the environment depend 
on a hidden static parameter referred to as the 
context. The objective is to learn a strategy 
that maximizes the accumulated reward across 
all contexts. The new model, called Contextual 
Markov Decision Process (CMDP), can model 
a customer’s behavior when interacting with a 
website (the learner). The customer’s behavior 
depends on gender, age, location, device, etc. 
Based on that behavior, the website objective is 
to determine customer characteristics, and to op¬ 
timize the interaction between them. Our work 
focuses on one basic scenario-finite horizon with 
a small known number of possible contexts. We 
suggest a family of algorithms with provable 
guarantees that learn the underlying models and 
the latent contexts, and optimize the CMDPs. 
Bounds are obtained for specific naive imple¬ 
mentations, and extensions of the framework are 
discussed, laying the ground for future research. 

1. Introduction 

Markov Decision Processes (MDPs) are commonly used to 
describe dynamic behavior in multiple fields such as signal 
processing, robotics, games, advertising, health and queues 
management (Puterman, 2005; White, 1993). When multi¬ 
ple trajectories are observed from a single source, a ques¬ 
tion in this context is the following; “Does each observed 
trajectory follow the same transition probabilities”? When 
the answer is affirmative, these transitions can be evaluated 
through standard maximum likelihood estimation (Boas, 
2006), and many techniques exist for different setups, most 
notably are the Hidden Markov Models (HMMs) method 
(Elliott et ak, 1995) in modeling and the Partially Observed 
Markov Decision Processes (POMDPs) (Aberdeen, 2003) 
in control. 
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However, in many applications there are additional exoge¬ 
nous variables that affect the model. We refer to these vari¬ 
ables collectively as the context. For example, the temporal 
behavior of sugar levels for diabetes patients is largely in¬ 
fluenced by their age and gender. Similarly, humidity mea¬ 
surements are greatly affected by the geographical location 
of the measurement device. Since these context variables 
do not change within each measurement, the standard so¬ 
lution of incorporating them into the state creating a much 
larger MDP or POMDP seems faulty as it reduces the gen¬ 
eralizing power of the model. Specifically, incorporating 
static features into the state forms distinct unconnected dy¬ 
namic chains. As transition probability between states with 
different contexts is always zero, a more compact model 
would be separate transition matrices for each context in¬ 
stead of one double sized matrix. 

1.1. Motivation for Contextual Dynamics 

A real world example for latent context learning is the prob¬ 
lem of identifying the user. Consider a large content web¬ 
site. Such a website has two main activities: (a) suggesting 
relevant content to its users and (b) presenting alluring ads 
for profit. Current methodologies that determine the rele¬ 
vance of the content and the ads require the user profile: 
age, gender, income level, device, location, etc. Usually, 
in order to determine whether a certain user is revisiting 
the website, mechanisms such as (HTTP) cookies are used. 
But in many cases these mechanisms are insufficient. What 
if the website does not have any prior information about 
the user (also known as the cold start problem-, Kohrs & 
Merialdo 2001)? Can we learn the user’s age or gender by 
observing his interaction with the website? In other words, 
given a trajectory of the pages visited by the user can we 
predict {cluster or classify) the user’s profile? And more 
importantly, can we take advantage of such clustering and 
tailor the policy to the user? 

This type of problem exists also in scenarios where we have 
information about the owner of a device, but several users 
use it and we want to identify them (such as children using 
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their parents tablets). In this work we suggest to model the 
user interaction in a Markovian fashion in order to identify 
the user (Meyn & Tweedie, 2009). 

A more elaborate scenario is when the user has been identi¬ 
fied, and we want to optimize the content and ads presented 
to him where the optimization criterion, for instance, is 
maximizing the user’s time spent in the website. In such 
cases, we model the interaction of a user as a Markov Deci¬ 
sion Process where different user’s groups may be modeled 
and optimized according to their context. In on-line adver¬ 
tising, solutions to such optimization problem are highly 
valuable, where the correct identification of users leads to 
higher click through rates (CTRs; Richardson et al. 2007). 
Hence, the ultimate goal is on-line learning an optimal con¬ 
trol when both the context, and the model’s parameters are 
unknown. Notice that a sub-goal in this case is the one de¬ 
scribed above: learning the underlying Markov dynamics. 

Our work’s main contribution is presenting a general al¬ 
gorithm with provable guarantees for the finite horizon 
episodic contextual MDP setup. Considering a specific 
implementation, we provide regret analysis and empirical 
parametric sensitivity analysis. Additionally, we discuss 
two applicative extensions of the model: the case of in¬ 
finitely many contexts, and the concurrent Reinforcement 
Learning (RL) (Silver et al., 2013) setting. The reader 
should bear in mind the solutions suggested are prelimi¬ 
nary and our focus is on presenting the problems along with 
their derived trade-offs, as well as setting the bar for future 
research. 

1.2. Related Literature 

Many previous works are related to the setup presented in 
this paper. In Hidden Markov Models (HMMs; e.g., El¬ 
liott et al. 1995) the Markovian state dynamics are latent, 
and the observed samples are transformation of the sources 
output. Works by Wilson & Bobick 1999 and Radenen 
& Artieres 2014 had considered adding context to HMMs, 
however the context in their model only affects the obser¬ 
vations distribution and not the state dynamics. 

A natural extension of HMMs to a control setting is 
POMDPs (Aberdeen, 2003). CMDPs can be modeled us¬ 
ing POMDPs by setting the context variable to be the origin 
and each possible MDP as a distinct outgoing chain. How¬ 
ever, POMDPs are too general and complex to capture the 
essence of the CMDP setup. In addition, POMDPs usually 
assume an underlying distribution on the contexts which 
we refrain from doing. 

The notion of context was borrowed from closely related 
works in the Multi-Armed Bandits (MAB) literature (Sut¬ 
ton & Barto, 1998; Bubeck, 2012), called Contextual-MAB 
(Langford & Zhang, 2007; Lai & Robbins, 1985). The ex¬ 


tension to the regular setting of MAB is that before the 
learner plays his turn, a context is presented to the user. 
Another similar paper is by Maillard & Mannor (2014), 
describing a setup in which only the rewards depend on an 
unobserved latent variable. They consider three cases: the 
reward function and context are known, the reward func¬ 
tion is known but the context is not, and where both are 
unknown. 

Other related literature considers model selection in MDPs. 
Doya et al. (2002) propose an architecture for multiple 
model-based reinforcement learning (MMRL). Their ap¬ 
proach decomposes a complex task into multiple domains 
in time and space and use a responsibility signal to weigh 
the outputs of multiple models and to gate the learning 
of the prediction models and controllers. Hence, respon¬ 
sibility signal measure how to mix different models such 
that various areas in the state space could be more easily 
modeled. A similar approach was used in learning meta¬ 
parameters of motor skills (Kober et al., 2012). In our 
work, we try to identify a single source that fits all the 
space. 

Another relevant problem is that of on-line representation 
learning (Nguyen et al., 2013; Maillard et al., 2011), deal¬ 
ing with finding the best state space while interacting with 
the environment. Differently, in the CMDP setup all mod¬ 
els share the same state space. Finally, Kiseleva et al. 
(2013) consider contextual MDPs, but from a purely ap¬ 
plicative perspective-they relate directly to web advertising 
by modeling user types. 

We conclude with a short comparison of CMDPs with other 
known extensions of the MDP model: 

1. In Contextual HMMs the context affects only the ob¬ 
servation distribution, and there is no control. 

2. POMDPs are a more complex structure generalizing 
CMDPs. Since in our case the hidden parameters are 
constant over time a simpler solution might exist. In 
addition, some distribution over contexts is assumed. 

3. In Multi-model RL the dynamics and rewards are 
composed of a convex combination of several mod¬ 
els, meaning that in each trajectory there can be more 
than one valid model. 

4. The problem of state representation is that of find¬ 
ing a suitable state space for given observations. In 
our case the state space is the same for all models, al¬ 
lowing more efficient solutions. 

5. Models described by robust MDPs (Nilim & 
El Ghaoui, 2005; Wiesemann et al., 2013) consider 
uncertainty in the transitions and rewards. CMDPs 
can be viewed as such, where the uncertainty is 
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not rectangular (around each state-action pair) but So essentially, CMDP is simply a set of models sharing the 

singular-determining one transition sets all of them. same state and action space. 


1.3. Paper Structure 

The paper is organized as follows: In Section 2 we for¬ 
mally define the CMDP setting, compare it to other models 
from the literature and introduce the general setup. Section 
3 describes the problem in more details and presents a gen¬ 
eral form algorithm to solve it. One specific instance of the 
algorithm is analyzed, and eventually some possible exten¬ 
sions are presented. In Section 4, we provide experiments 
and discuss trade-offs in our setup. Finally, in Section 5 we 
conclude and lay down future directions. 


2. Contextual Markov Decision Processes 


We begin with defining a standard Markov Decision Pro¬ 
cess (MDP; Puterman 2005). 

Definition 1. (MDP Setup) A Markov Decision Process 

is a tuple (S,A,p{y\x,a),r{x),'KQ) where S is the state 
space, A is the action space, p{y\x, a) is the transition 
probability (y,x S S,a € A), r(x) is a reward function, 
and ttq is the initial state distribution. 


Given a deterministic horizon T, the learner-interaction is 
as follows. At the beginning of each episode, an initial state 
xq is chosen according to the state distribution ttq. After¬ 
wards, for each 0 < t < T, the learner chooses an action 
according to a policy p{at\xt) where at G A, Xt G S. We 
note that the policy may be a random function. The envi¬ 
ronment provides a reward r{xt) and the next state is (ran¬ 
domly) chosen according to p{xt+i\xt, at). In general, the 
learner’s goal is to maximize the following value function: 


J'" = E 


r T 


'^r{xt) 

_t^0 


Xo 


TTo,p{a\x) 


where the expectation is taken over trajectories with respect 
to the policy p.{a\x) and the initial distribution wq. 


When the MDP parameters are given, the problem of 
finding the policy which maximizes cumulative reward is 
known in the literature as planning (Puterman, 2005; Bert- 
sekas & Tsitsiklis, 1995). When the MDP parameters are 
unknown in advance, finding the best policy is known as 
Adaptive Control or Reinforcement Learning (RL; Puter¬ 
man 2005; Bertsekas & Tsitsiklis 1995). 


The following definition establishes the extended model 
considered in the paper, denoted by Contextual MDPs. 

Definition 2. Contextual Markov Decision Process 
(CMDP) is a tuple (C,S,A,fA(^c)) where C is called the 
context space, S and A are the state and action space 
correspondingly, and M. is function mapping any context 
c G C to an MDP j\4(c) = (S, A,p'^{y\x, a), r'^{x), ttq). 


The simplest scenario in a CMDP setting is when the con¬ 
text is simply observable. In this setting, the problem re¬ 
duces to correctly generalizing the model from the context. 
If the observable context c is finite where \C\ = K , then 
with no further assumption, one can simply learn K differ¬ 
ent models. 

An interesting problem arises when K scales with the num¬ 
ber of sampled trajectories. For instance, consider the prob¬ 
lem of targeted advertising: Given behavioral patterns and 
side information of many customers, companies usually 
seek to group the consumers so they can target their needs 
and habits. Since side information usually resides in a very 
large set (for example, the cross-product of gender, age, 
etc.), in practice it is aggregated when the number of clus¬ 
ters depends on the amount of available data. 

The model aggregation problem is not considered in this 
work, and instead we focus on latent contexts for the rest 
of the paper. Additionally, we assume the initial state dis¬ 
tribution and rewards are context independent, maintaining 
the hardness of the problem while greatly simplifying the 
writing. Finally, we adopt the common [0, l]-bounded re¬ 
ward assumption. 

2.1. General Setup 

We define the general setup as follows: The context space 
consists of K possible contexts. The time axis is divided 
into H episodes, denoted by ei,..., err. In the beginning 
of each episode, the environment chooses a context c G C 
(in a random, adversarial or any other fashion). Afterwards, 
an initial state is randomly chosen according to an initial 
state distribution 7r§. A trajectory of length T is generated 
where T is a stopping time (Meyn & Tweedie, 2009). Then, 
for the chosen MDP an interaction as described in Defini¬ 
tion 1 is applied until the end of the trajectory. 

3. Problem Definition and Solution 

We assume a small finite C, and that T is bounded almost 
surely, denoting this setup as finite sources episodic CMDP. 
The goal is maximizing over the cumulative rewards from 
all trajectories by the iF’th trajectory, for increasing H. 
Therefore, we measured performance with respect to H. 
Ideally, a good policy should optimize the trade-off be¬ 
tween exploration and exploitation of the current chain. 
However, unlike the standard RL setup, the exploration in 
this case should consider not only the model’s parameters, 
but also the hidden context. 

We measure our performance with the notion of regret: the 
difference between the cumulative reward and the cumula- 






Contextual Markov Decision Processes 


tive reward obtained by an agent satisfying some optimality 
property. For example, in infinite horizon RL the cumula¬ 
tive discounted reward is compared against an agent with 
knowledge of the true model who can therefore start from 
the optimal policy (Auer et al., 2009); the faster the regret 
bound converges to 0 with T the better. 

Similarly, we compare ourselves to the all knowing agent 
applying the optimal policy for the correct context at each 
trajectory. In our setup, since T is bounded the regret is 
evaluated mainly with respect to the number of trajectories 
H. Notice though, that in each new trajectory some loss 
is guaranteed until the correct context is identified. There¬ 
fore, the regret will always be linear in H. A different opti¬ 
mal agent, when there is some prior distribution over con¬ 
texts, can be chosen to perform the solution of the resulting 
POMDP, but there may be other appropriate choices. The 
problem of redefining the regret to obtain more meaningful 
bounds was left for future research. 

Definition 3. For the problem of finite sources episodic 
CMDP we define the regret over H trajectories to be: 

H H Th 

Regret = Ea*-EE rh,t , (1) 

h=l h=l t=l 

where is the optimal value function in Th steps for the 
context chosen in the h’th trajectory, and Thy is the reward 
obtained by the agent in the h’th trajectory at the t’th step. 

In order to solve the problem of regret minimization we 
introduce the CECE general framework (Cluster-Explore- 
Classify-Exploit) that partitions the trajectories to mini¬ 
batches. In the beginning of each mini-batch, all previ¬ 
ously seen trajectories are used to form K distinct models 
through Algorithm 1 (Cluster). Then, for each new trajec¬ 
tory in the current mini-batch the agent generates a partial 
trajectory using Algorithm 2 (Explore). The partial trajec¬ 
tory is then classified to a context by Algorithm 3 (Clas¬ 
sify). Einally, Algorithm 4 sets the policy for the remainder 
of the trajectory (Exploit). In summary: 


Alg. 1 Cluster observed trajectories to K models. 

Alg. 2: Explore the context. 

Alg. 3: Classify partial trajectory to model. 

Alg. 4: Exploit the identified model. 

The following assumptions and theorem guarantee CECE’s 
performance: 

Definition 4. 1. Let: 

Mi = {S,A,pi{y\x,a),r{x),T:Q), 

M 2 = {S,A,p2{y\x,a),r{x),T:o) 


be two MDPs with the same state space, action space, re¬ 
wards and initial state distribution. We define M 2 to be an 
e-approximated model of Mi if for every state-action pair 
(s, a) € S X A: 

||Pri(-|s,a) - Pr 2 (-|s,a)||i < e. (3) 

2. Let: 

Xi = (Ci,S,A,Mi(c)), 

X2 = (C2,S,A,M2(c)) 

be two CMDPs with the same state and action space satis¬ 
fying \Ci\ = IC 2 I. We define X 2 to be an e-approximated 
CMDP of Xi if there exists a matching between the con¬ 
texts / : Cl -f-)- C 2 such that for every c € Ci we have that 
M. 2 {c) is an e-approximated model of fAi{f{c)). 

Assumption 1. Let Hq be some constant number of trajec¬ 
tories. For every H > Hq there exists Si{H),e{L[) > 0, 
such that after applying Algorithm 1 on H trajectories, 
with probability at least 1 — (5i {FI) the estimated K-models 
form an e{F[)-approximated CMDP of the true CMDP. 

Assumption 1 guarantees that having enough trajectories 
will drive Algorithm 1 to output an approximated model 
for each context. It envelopes a hidden assumption that 
all contexts were observed enough times. Since there is 
some probability of error (5i, the clustering procedure must 
be repeated when more trajectories are presented to ensure 
diminishing regret; that is the reason a mini-batch scheme 
is applied. 

Assumption 2. For every e > 0, there exists 52{e) such 
that given an e-approximated CMDP, after applying Algo¬ 
rithms 2 and 3 the correct context is identified with prob¬ 
ability at least 1 — (52(e)- In addition, the number of steps 
taken is a stopping time denoted by Tec- 

This assumption assures us each trajectory will be classi¬ 
fied correctly with high probability, which will guarantee 
good performance for exploitation in the next step. More¬ 
over, Tec represents the number of samples needed to dif¬ 
ferentiate between the models. 

Assumption 3. Given an e-approximated model. Algo¬ 
rithms 4 obtains Regret < C(c)- 

Assumption 3 establishes the regret provided by Algorithm 
4 when the models are well-approximated. 

Theorem 1. Let Hi be the number of trajectories in the 
i’th mini-batch. Then if Assumptions 1, 2, 3 hold, CECE 
achieves in the L ’th mini-batch: 

Regret < (1 - 6i)Hl{52^T -f (1 - ^ 2 )(C + ETec)) 

-f 5iHlV,T 


( 5 ) 
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where 5i = Si{H),e = ei{H),S 2 = ^2(e)jC = C(c) 

The proof is a straightforward combination of the given as¬ 
sumptions. 

3.1. Discussion 

Notice that in order for Assumption 1 to hold with a mean¬ 
ingful e, when Hi is set each model must be observed suffi¬ 
ciently. This fact should be added as an additional assump¬ 
tion depending on the specific realization of Algorithm 1. 
Supposedly the subsequent Hi’s can be chosen arbitrarily 
small, utilizing information from new trajectories as soon 
as it is available. Yet, Algorithm 1 may be computationally 
expensive, making larger Hi’s preferable in practice. An¬ 
other possible approach to this trade-off is to apply on-line 
clustering (Ailon et al., 2009). 

In essence. Algorithm 1 is a form of Multiple Model Learn¬ 
ing (MML) algorithm (Vainsencher et al., 2013) - each tra¬ 
jectory is a sample from an unknown model (context) and 
the goal is learning all models simultaneously. It could also 
be reduced to the clustering problem, where each trajec¬ 
tory is represented as an S' x S' x A vector of its empiri¬ 
cal transition matrix. Indeed, some information is lost in 
this process: the number of samples from each (s, a) pair 
in the trajectory is ignored despite its effect on the variance 
around the sampled distribution. So, ideally each trajectory 
should be reduced to a point with varying variance across 
dimensions, which gets smaller for longer trajectories. 

Subsequently, one may question whether e(H) can con¬ 
verge to 0 for infinitely many trajectories. In our setup, as 
T grows the trajectories are more distinct, but T is bounded 
almost surely. So even for large T’s, there would be at least 
some constant portion of the trajectories acting as outliers 
of the model they originated from, possibly tainting the 
clusters. One way to solve this issue is through an out¬ 
lier robust clustering (for example K-median ; Har-Peled & 
Mazumdar 2004). 

Next, consider the effect of the trajectories length T on the 
hardness of the problem. When T is very large, it is much 
more important to recognize the correct model. Since Al¬ 
gorithm 4 (exploitation) is applied for a longer duration, it 
could include an exploratory part to obtain a better model 
while running the trajectory, in addition to shielding against 
wrongful classification. 

The other extreme case is when T is too small to deter¬ 
mine the correct model with high probability. Assuming 
the models can still be approximated, one reasonable so¬ 
lution would be to try and optimize the worst case perfor¬ 
mance over all models. This approach is closely related to 
the problem of Robust MDPs (Nilim & El Ghaoui, 2005) - 


a formulation of MDPs with uncertainty in the transitions 
and rewards. When the uncertainty set is rectangular an 
efficient solution exists. However, in our case it is singu¬ 
lar - setting one transiton probability is the same as setting 
the context along with its related transition matrix; thus the 
problem is intractable (Wiesemann et al., 2013). 

When all trajectories are short, it might be impossible to 
provide an approximation of the true models. Consider 
for example the extreme case where only one transition is 
given - unless there is a stationary distribution over contexts 
the models cannot be learned nor optimized. Subsequently, 
varied T lengths pose another question: how confident are 
we in the clustering of each trajectory? Embedding short 
trajectories might inject more noise to the clustering pro¬ 
cess than improve it, so some selection is needed to insure 
proper modeling. This question may relate to the notion 
of clusters separability (Ostrovsky et al., 2006) - short tra¬ 
jectories can lead to non-separable models that cannot be 
learned through clustering. 

A rather simple realization of Algorithm 2 (exploration) is 
to apply a fixed policy until some condition is fulfilled. One 
may consider what is the policy which will achieve this 
condition with as few steps as possible (since the regret is 
linear in the number of exploration steps). 

Eor instance, if there are only two models a logical ap¬ 
proach would be to choose actions maximizing the distinc¬ 
tion between the models. However, this is non-optimal as 
actions have future consequences - a distinctive action for 
one state could lead the agent to an area of the state space 
which is very similar between the models. 

A follow-up idea is using the original state and action 
space, and reshaping the rewards to award actions for dis¬ 
tinguishing between the models. However, this solution is 
still problematic since the underlying transition probabili¬ 
ties are unknown and could be these of either of the possi¬ 
ble models. Hence, finding a good exploration policy is an 
open question we hypothesize to be as difficult as solving a 
singular Robust MDR 

Einally, consider the effect of increasingly more possible 
contexts. These increase both the size of the initial Hi re¬ 
quired for clustering, and the number of samples needed 
for model identification Tec- The case of infinitely many 
models requires some changes in the algorithm, as dis¬ 
cussed in the end of this section. 

3.2. A Specific Instance 

We an example for an instance of CECE and substitute in 
Assumptions 1, 2, 3. Eor simplicity, we assume the trajec¬ 
tory length is a constant T for the remainder of the analysis. 
The proposed realization was chosen to be trivial to allow 
simple analysis; It is only a demonstration of the trade-offs 
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in CMDPs and CECE’s modularity. 

Algorithm 1 is the following scheme: 

1. Eor each trajectory h, and state action pair (s, o), es¬ 
timate the transition probability Pr;i(-|s, a) by its em¬ 
pirical distribution. 

2. Go over all possible partitions of trajectories to K sets 

and minimize over the following score: 

K 

max||Pr/,(-|s,a) - Prfe(-|s,a)||i, (6) 

k=lh<^Ck 

where Prfc(-|s, a) is the estimated transition probabil¬ 
ity for all trajectories in the cluster. 

This scheme is highly inefficient as it performs an exhaus¬ 
tive search for the best partition. However, as a preliminary 
result all we require is for it to accommodate Assumption 

1. There are other polynomial time clustering algorithms 
with guarantees (Ostrovsky et al. 2006; Arthur & Vassilvit- 
skii 2007 for instance), but their bounds and assumptions 
would have to be adjusted to our case. 

In Algorithm 2, the uniform policy over actions is applied 
for a constant number of steps Tec- As mentioned above, 
this procedure could be improved. Eor once, the total num¬ 
ber of steps could be decided on-line according to the confi¬ 
dence. Moreover, there might be other exploration policies 
that could produce faster identification of the true model, or 
even combine exploitation in the strategy to generate over¬ 
all smaller regret. 

The proposed Algorithm 3 chooses the model obtaining the 
smallest Li distance between the set of models and the em¬ 
pirical transition matrix from the partial trajectory. Other 
possible methods include maximum likelihood, weighted 
Li or L 2 distance, and methods taking into account the cost 
of choosing a wrong model. 

Lastly, Algorithm 4 was chosen naively to apply the ex¬ 
ploitation policy with regards to the estimated model. A 
more sophisticated approach would be to consider an RE 
algorithm whose regret with respect to T goes to 0. Since 
in our scenario T is constant, the suggested solution is sat¬ 
isfactory. 

We can now quote the necessary assumptions and resulting 
Corollary: 

Assumption 4. Let a,j3€ (0,1). 

1. By the H’th trajectory, each model was sampled at 
least I3H times. 

2. For some D, for every two contexts ci, C 2 and s, a: 
l!Pi'ci(-|s,a) - Prc2(-|s,a)|| > D. 


3. In every trajectory, each state-action pair is visited 
at least aT times, and T is large enough: T € 

The first part guarantees each model is sampled enough 
times for the classification to converge. The second part 
provides a constant difference between the models, such 
that with enough data the estimated models will be sepa¬ 
rable. The last part of the assumption is needed to make 
sure there are enough samples in each trajectory to learn 
the model. It can be guaranteed by requiring Tec to be 
long enough, assuming that the induced MDP is ergodic 
under the uniform policy. 

Lemma 1. If Assumption 4 holds, the described realization 
of Algorithms 1-4 satisfy Assumptions 1-3 with: 

e{H) G 0{KSAe^-°‘'^^"), 

(H) G 

52 (e) G D>2e 

C(e) G OiS^T^e). 

The full proof is available in Section C of the supplemen¬ 
tary material. 

Corollary 1. If Assumption 4 holds, the described realiza¬ 
tion of Algorithms 1-4 achieves in the L’th mini-batch: 

Regret <0{HLTKe^-'^^^^^ 

+ 0{HLT'^KS^Ae^-^^°" + HlTec) (8) 

-f 0{HLTKSAe^-°‘'^^'^^^), 

where H = Ylfji 

Notice that each summand relates to a different error: 

1. The first summand corresponds to trajectory misclas- 
sification. It can point us to proper choice of Tec- 
scaled with S and the distance between models. 

2. The second summand corresponds to the context and 
model uncertainty. Large T and a are required to es¬ 
timate each model well enough. 

3. The third summand corresponds to trajectories mis- 
clustering. It is the only error which diminishes with 
H, as the exponential multiplicative converges to 0. 

3.3. Extensions 

There are other interesting extensions to the previous setup 
exhibiting different trade-offs. Eor once, consider the more 
complicated scenario when there is an infinite or unknown 
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number of models. CECE’s mini-batch solution can be ad¬ 
justed to this case by adding a probability to reject all mod¬ 
els in Algorithm 3, but the clustering step will be much 
harder to evaluate in this case. Consequently, regret anal¬ 
ysis requires a more precise setup, for example bounded 
ratio between the number of contexts and trajectories, or 
some distribution over contexts. 

A more natural setup in web advertising applications is the 
concurrent RL setup (Silver et al., 2013). Assume the agent 
interacts with multiple infinite horizon trajectories, where 
each time step one trajectory (which may be new) requires 
an action. In the CMDP setup, each trajectory originates 
from a different latent context. The performance in this 
case should take into account both the length and number 
of trajectories. 

A rather naive solution would be to employ some RL algo¬ 
rithm (for example, Q-learning; Watkins & Dayan 1992) in 
every trajectory, regardless of the other trajectories. This 
approach ignores information on the model obtained from 
other trajectories sharing the same context. Thus, if there 
are many short trajectories it could produce high regret. 

A different solution is applying some variation of CECE’s 
scheme - in each time step in a trajectory; first cluster (Al¬ 
gorithm 1), and then either (a) choose an option which ex¬ 
plores the context (Algorithm 2), or (b) classify the par¬ 
tial trajectory (Algorithm 3) and choose an action exploit¬ 
ing the context (Algorithm 4). Even though the trajectory 
length is unbounded, as long as more model samples are 
obtained from other trajectories the error in the exploita¬ 
tion phase decreases. Actual regret bounds for both ap¬ 
proaches depend on the parameters and assumptions of the 
specified problem. When there are few long trajectories the 
first independent RL approach would prevail (with regret of 
0{Hs/T)\ Auer et al. 2009) , while many shorter trajecto¬ 
ries are better dealt with a CECE variant (with regret of 
0{HTec + s/THK) for equal probability contexts). 

4. Experiments 

In this section we discuss the trade-offs that exist in the 
CMDPs settings. In the first experiment we test only the 
clustering part in CECE. We consider a CMDP with K = 5 
equal probability contexts, |,A| = 2 actions and |iS| = 100 
states where the transition matrix for each context was 
drawn from a uniform distribution. We generate H trajec¬ 
tories of a constant length T sampling actions uniformly. 
Eor the purpose of scoring the clusters we calculate the 
entropy of each distribution over clusters for each correct 
context, and average the results according to the number 
of samples from that context. Thus, when the trajecto¬ 
ries are perfectly clustered, for each context the entropy 
will be 0 and so will be the average. The worst possible 


score log(iT) results from independent clusters and con¬ 
texts. The clustering algorithm we used in this case was 
AT-means (Duda et al., 2012) on the vectorized empirical 
transition matrices, the results were averaged over 100 tri¬ 
als and were added error bars of one standard deviation. 

We examined the following: (1) How long should trajecto¬ 
ries be to obtain favorable clustering? (2) How the quality 
of the clustering depends on the number of episodes, for 
various trajectories lengths? In the first part of the exper¬ 
iment (top plot in Eigure 1) we generate H = 100 trajec¬ 
tories and present the score as a function of the trajecto¬ 
ries length T. In the second part of the experiment (bot¬ 
tom plot in Eigure 1), we generate trajectories of varying 
lengths T = 2000, 5000, 8000 and measure the score as a 
function of the number of episodes H. 


Figure 1. Experiment 1 




H - Number of episodes 


We draw the following conclusions: (1) There is a phase 
transition in the clustering performance with respect to T: 
below a certain threshold (here T = 4000) the clustering 
utterly fails, followed by a short adjustment period, where 
finally (here at T = 8000) the clustering succeeds almost 
certainly. (2) If the trajectories are too short, the clustering 
will fail even when increasing the number of episodes. (3) 
If the trajectories are sufficiently long, additional episodes 
improve the clustering quality (as implied by Lemma 1). 

Next, we experimented with the full CECE algorithm. We 
simulated a CMDP with |5| = 100 states, |,/l| = 4 ac¬ 
tions and iC = 20 contexts of equal probability. Each trial 
consists of H = 100 episodes of length T = 2000. The 
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results were averaged over 20 experiments. The parameter 
Tec sets the portion of the trajectory time steps dedicated 
to identify the model, and was taken to be p • T, p = 0.3. 
The learning policy employed by Algorithm 2 was taken 
to be uniform over all actions. The exploitation algorithm 
used is Q-learning (Bertsekas & Tsitsiklis, 1995). 

We performed four experiments where in each of the exper¬ 
iments all the parameters excluding one were fixed. The 
average reward throughout the experiment is measured. 
The results are presented in Figure 2. On the top-left and 
bottom-right plots we can see how CECE behaves as the 
number of episodes and trajectory length increase. As more 
data are available, the average reward increases since the 
clustering phase performs better and the models are better 
learned. Similarly, the average reward decreases as more 
models are introduced (top-right plot) since it is harder to 
cluster and learn each model. Notice that for constant pro¬ 
portion there will always be a difference between the 
optimal and the achieved value due to the identification 
phase. 


Figure 2. Experiment 2 




T) - Ratio of learning 



T - Length of each episode 


An interesting result is presented in the bottom-left plot. 
The parameter rj = describing the portion of samples 
taken to identify the correct model. The resulting plot rep¬ 
resents the exploration-exploitation trade-off for our sug¬ 
gested model: How many samples are used to identify the 
correct model against how many of them are used to opti¬ 
mize the C-MDP. 

5. Conclusions and Future Work 

In this work we presented a new framework for model¬ 
ing multiple Markovian sources with sequential decision 
making. While our models can be encompassed in exist¬ 
ing models (e.g., POMDPs; Aberdeen 2003) the proposed 
setup offers much flexibility in modeling both observable 
and latent static context while maintaining computational 


tractability. We demonstrated that under certain conditions 
one can overcome two fundamental problems: (1) learning 
the model parameters, and (2) optimizing on-line the action 
within an RL framework. We suggested and analyzed basic 
algorithms when the number of contexts is finite. 

This paper is but a first step in developing the contextual 
MDPs framework. Since CECE is a modular solution its 
performance can be improved by independent upgrades to 
its building blocks, such as: 

1. The clustering techniques we used are somewhat in¬ 
efficient and does not consider the confidence of each 
trajectory. 

2. Data and models dependent learning policies could 
possibly classify the trajectory in less steps. 

3. Reward oriented context classification can lead to im¬ 
proved overall regret. 

4. Incorporating context exploration in the exploitation 
phase hedges against miss-classification. 

There are other schemes to solve CMDPs. A rather sim¬ 
ilar approach is combining the Exploration-Classiflcation- 
Exploitation steps to form a belief over models and solve 
accordingly (like MMRL; Doya et al. 2002). Another rea¬ 
sonable approach when there is some distribution over con¬ 
texts is to model the problem as a POMDP (Aberdeen, 
2003), and then learning and optimizing it. Finally, it is 
possible to view the optimization problem as a robust MDP 
(Nilim & El Ghaoui 2005; where uncertainty is on which 
model the data come from). While solving the resulting 
Robust MDPs directly is hard computationally, a rectangu¬ 
lar relaxation can be possibly used to provide an approxi¬ 
mated result; one future direction is to investigate this ap¬ 
proximation. 

The concurrent RL setup (Silver et al., 2013), as well as 
the case of many or even infinitely many contexts are of 
practical importance. We have presented rough ideas on 
how to pursue these, but the exact theoretical setup requires 
a more precise definition (what guarantees could be made, 
what assumptions must hold and so on). 

The issues of computational efficiency and sample com¬ 
plexity are important and were not tackled in this pa¬ 
per. Despite the availability of big data in many appealing 
venues, the state, action and context spaces may scale ac¬ 
cordingly. Hence, an interesting theoretical and practical 
concern is the error and regret rates for finite sample size; 
finding these requires a more subtle analysis and is left for 
future work. 

Subsequently, for very large state or action spaces, straight¬ 
forward implementation of the model-based approach will 
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fail as the number of samples required to learn the model 
grows accordingly. Solving this problem within the CMDP 
framework may introduce some intriguing connections. 
For example, if the linear function approximation tech¬ 
nique is used (Sutton & Barto, 1998), the problem of clus¬ 
tering same-policy trajectories corresponds to the subspace 
clustering problem (Vidal, 2010). 

In conclusion, from an algorithmic and analytic points 
of view the theoretical trade-off between learning, explo¬ 
ration, optimization, and control of CMDPs is still very 
much an open question. 
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A. List of Notations 


Notation 

Meaning 

5 

State space or number of states 

A 

Action space or number of actions 

T 

Time horizon 

t 

Time index t = {)..T 

H 

Number of trajectories in batch data 

Hl 

Number of trajectories in the L’th mini-batch 

C 

Number of possible contexts 

tM 

Value of policy p in model M 

D 

Minimal inf-distance between two distinct models. 


B. Useful Lemmas 

The following Lemmas are used in the proofs: 

Lemma 2. (Weissman et al., 2003) Let P be a probability distribution on the set 5=1, S. Let X™ = Xi, X 2 ,Xm 
be independent identically distributed random variables distributed according to P. Then for all e > 0, 

Pr(|lP-Px-||i>e)<e^-™^'/2 (9) 

Lemma 3. (Kearns & Singh, 2002) Let M be an MDP over S states, and M be an 0{e)-approximation of M. Then for 
any policy p,: 

( 10 ) 

and consequently for the optimal policy in each MDP correspondingly: 

I Jm - (11) 


C. Proof of Lemma 1 

Lemma 1. If Assumption 4 holds, the described realization of Algorithms 1-4 satisfy Assumptions 1-3 with: 

e{H) G 0{KSAe^-°‘^^^), 

SiiH) G 

(52(e) G D>2e 

C(e) G 0{S^Th). 

Proof We show each Assumption holds, starting with Assumption 1. 

For two transition functions Pi, P 2 of size S x S x A denote: 

ll^’i - ^211 = max||Pi(-|s, a) - P 2 (-|s,a)||. (13) 

s,a 

We denote by Pr?i, Pi'c the estimated transition matrices from trajectory h and cluster c correspondingly. In addition, C* 
is the true clustering of each trajectory, and is the clustering found by the algorithm. 

Since there are at least aT samples from each state-action pair, according to Lemma 2 and the union bound, we obtain 
that: 

Pr(||Prft,(-|s,a) - Prc*(/i){-|s,a)|| < e) > 1 - (14) 

Since there are at least f3H trajectories from each model, we also obtain that: 


Pr(|lPrc*(?i)(-|s,a) - Prc.(?*)(-|s,a)|| < e) > 1 - SAe^ aTfiHA/ 2 ^ 


(15) 
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and therefore: 


H 


Pr(^ ||Prc.(^.)(-|s,a) - Prc.(„)(-|s, a)|l <He)>l- l\ 


(16) 


^=1 


Now we obtain the following: 

H H 

ii^c*(/i) - Pc-v^{h)\\ < x! 11^^ “ Pc-{h)\ 


H 


h^l 


h^l 

H 


<Y^\\p^-Pc. 


{h)\ 


\\Ph — Pc°p^{h)\\: Triangle inequality 

h^l 

H 

y^ \\Ph — Pc*{h)\\^ By Algorithm definition 


(17) 


h^l 

H 


h^l 

H 


<2^||P,-Pc*(.)II+EII^c — Pc*(ti)li) Triangle inequality (second term). 


h=l 


h=l 


When P is large, we can approximate 2 \\Ph—Pc*(h)\\ G 0{H{l—S)e+HS) = 0{He+Hd) for d = 

since each summand is bounded by e with that probability, and when it is unbounded the maximal value of Li distance 

between two distributions is a constant 2. Therefore: 


77 P, \\Pc*{h) - P°p^h)\\ G 0(e + S) 


H 


H 


(18) 


h=l 


with probability at least 1 — iCS'Ae® for 5 = 5'Ae'® 

Since the average is of that order, there must exist a matching between the true clusters and optimal clusters satisfying: 


max II Pc — Pc°p‘ II G 0(e + <5) 
cec* 


(19) 


If the distance between every two true clusters is D > 0(e + S), the agreement between matching clusters are on all 
trajectories in a reasonable radius, i.e. 0(1 — 6) of the trajectories, so the error in each model is of the order 0(K5): 


|lPrc(-|s, a) - Pri(-|s,a)|| < KSAe^ aTAl 2 ^ 
Now in order for D > 0(e + J) to hold, we can choose e to be of order D, and then: 

e 0(D) ^ P G 0(^ 


( 20 ) 


( 21 ) 


To summarize, for T G log(;g^)) we obtain that with probability at least 1 — (5(P), ||Prc(-|s, a) — Pri(-|s, a)|| < 

e(i?), where: 

e(H) G 5(H) G 0(KSAe^-‘^^^^^'). (22) 

Next, we show Assumption 2 holds. We bound the probability of misclassification by the following probability: 

Pr(||P^-Pc(^)|| < jA\Ph-Pc^cih)\\ > f), (23) 

as if this event occurs then the true model will be chosen. To bound this quantity, we use the union bound over the 
complement event, so we need to bound: 


For the left term: 


Pi'dlPt. - Pc(h)\\ > §), Pr(|lA - Pc/C(mII < f). 


PrdlP/t - Pc(h)\\ > < PrdlP/i - Pc{h)\\ + \\Pcih) - Pc(h)\\ > 

D 


(24) 


<PrdlP,-Pc(,)|| > ^-e) 


(25) 


< e 


S-rEc(f-e)V2 


Lemma 2 
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For the right term: 

Pr{\\Ph - Pc^C(h)\\ ^ y) ^ Pr(ll-P/i - Pc{h)\\ - \\Pc^C(h) - Pc(h)\\ - \\Pc^C{h) - Pc^C(h)\\ < y) 

< Prdl-P/i - Pc{h)\\ < y + e + -D) 

< gS-TEc(V+'^) Lemma 2 


Now, using the union bound we obtain that the classification is correct with probability at least 1 — (5, where 

g ^ ^S-TEc{§-e)V2 ^gS-TEc(3^+e)V2 


(27) 

□ 



